Organizing Academic Papers Using Ward Clustering

For my second year qualifying exams, I'm given academic papers to read by my three committee members. Each committe member has a different expertise (Computational Methods in Social Neuroscience, Memory, and Social Interaction and Faces) that allows me to learn about a wide bredth of my interests.

With approximately 20 papers per professor, I was overwhelmed on how to organize the content I was given. While some members were kind enough to organize the papers for me, others left it to me to make meaning out of what I was given. Interestingly, there was some overlap between topics and even overlap between suggested papers, which made me think to organize all of them together!

So, I turned to python to do it for me. I was aided by Brandon Rose's online tutorial: Document Clustering with Python, (http://brandonrose.org/clustering_mobile), which has a great in-depth explanation of all he did.

To begin, I used a command-line tool from xpdf called pdfottext to convert my pdfs to text files. The command is just pdftotext file but I sped it up with a for loop and fixed the encoding for file in *.pdf; do pdftotext -enc UTF-8 "$file" ; done.

I also did a quick renaming of the files to remember who they were reccommended by using rename: rename 's/^/Luke - /' *

In [51]:
import numpy as np
import pandas as pd
import nltk
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

First I uploaded my files and set up my corpus of text

In [52]:
import glob, os

# Get all the text files in my pdf directory
path = './pdf'
text_files = [f for f in os.listdir(path) if f.endswith('.txt')]

text_files.sort()

# Load them into an array
paper_text = []
for file in text_files: 
    f = open(os.path.join(path, file), "r")
    paper_text.append(f.read())    
    
    
# Get just the paper names (remove .txt)
paper_names = [x[:-4] for x in text_files]

# Get which committe member reccommended the file (get the first word)
committee = [x.split()[0] for x in text_files]
committee_order = ['Luke', 'Thalia', 'Jeremy']

Then I used Brandon's code for ntlk tokenization

(taken and modified from http://brandonrose.org/clustering_mobile)

In [53]:
# load nltk's English stopwords as variable called 'stopwords'
stopwords = nltk.corpus.stopwords.words('english')

# load nltk's SnowballStemmer as variabled 'stemmer'
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

# here I define a tokenizer and stemmer which returns the set of stems in the text that it is passed

def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

#not super pythonic, no, not at all.
#use extend so it's a big flat list of vocab

totalvocab_stemmed = []
totalvocab_tokenized = []
for i in paper_text:
    allwords_stemmed = tokenize_and_stem(i) #for each item in 'paper_text', tokenize/stem
    totalvocab_stemmed.extend(allwords_stemmed) #extend the 'totalvocab_stemmed' list
    
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)
    
    
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,4))

tfidf_matrix = tfidf_vectorizer.fit_transform(paper_text) #fit the vectorizer to paper_text
//anaconda2/envs/py3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:385: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov', 'nobodi', 'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv', 'twenti', 'veri', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))

Using Brandon's filtering method, I also checked that it seemed that topics were clustering into meaningful categories

In [54]:
# (taken and modified from http://brandonrose.org/clustering_mobile)
from sklearn.cluster import KMeans

num_clusters = 10

km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()
terms = tfidf_vectorizer.get_feature_names()
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)

#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

for i in range(num_clusters):
    print("Cluster %d words:" % i, end='')
    
    for ind in order_centroids[i, :10]: #replace 6 with n words per cluster
        print(' %s' % vocab_frame.loc[terms[ind].split(' '), :].values.tolist()[0][0], end = ',')
    print() #add whitespace
    print() #add whitespace

    
Cluster 0 words: story, narrative, voxel, fig, event, scrambled, al., et, distance, semantically,

Cluster 1 words: al., et, dynamic, decoding, nodes, fluctuations, intact, fmri, preprint, connectivity,

Cluster 2 words: recall, movie, scenes, participant, items, listening, fig, narrative, audio, spoken,

Cluster 3 words: layers, training, sentence, language, neural, et, al., vector, embedding, transfer,

Cluster 4 words: reward, evolution, targets, actions, neuronal, construct, environment, transitions, q, prospect,

Cluster 5 words: face, familiar, et, al., person, identities, adapt, gobbini, biased, di,

Cluster 6 words: semantically, category, voxel, al., et, area, decoding, fmri, estimate, semantically,

Cluster 7 words: nodes, community, connectivity, k, scales, n, et, al., area, layers,

Cluster 8 words: social, interacting, conversation, actions, intentions, communication, mental, mirror, mind, musical,

Cluster 9 words: recall, event, topic, video, manuscript, boundary, episodes, trajectories, semantically, participant,

These track well with my understanding of how these texts should be categorized. Some are about face processing, some are about social interaction and some deal with communication and memory.

By creating a clustermap, I can also see if these texts seem to cluster somehow. You'll notice that one shows a similarity of 1, and that's because its the exact same paper (a good sanity check). Down by the end, I've also noticed the three face identification papers all clump together as well!

In [55]:
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)

sim = pd.DataFrame(cosine_similarity(tfidf_matrix))
sim.columns = text_files

sns.clustermap(sim.T, cmap='RdBu_r', figsize = (15,15))
Out[55]:
<seaborn.matrix.ClusterGrid at 0x1a250a46d0>

Finally, I can get to my goal of categorizing these texts, and I chose to use a dendrogram to see all the associations and manually chose how many categories I would like (I ended up with 10).

In [56]:
from scipy.cluster.hierarchy import ward, dendrogram
import seaborn as sns
sns.set_style("dark")

linkage_matrix = ward(dist) #define the linkage_matrix using ward clustering pre-computed distances

fig, ax = plt.subplots(figsize=(8, 16)) # set size
ax = dendrogram(linkage_matrix, orientation="left", labels=paper_names,color_threshold=1.5, above_threshold_color='gray');


plt.tick_params(\
    axis= 'x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom='off',     
    top='off',        
    labelbottom='off')


# Recolor X Labels
ax = plt.gca()
xlbls = ax.get_ymajorticklabels()


my_palette = ['maroon', 'navy', 'darkgreen']
num=0
for lbl in xlbls:
    val=committee_order.index(str(lbl).split(" ")[2].replace("'",""))
    lbl.set_color(my_palette[val])
    num+=1
    
# save figure
plt.savefig('specialist_grouping.png', dpi=200, bbox_inches='tight') #save figure as ward_clusters
In [ ]: